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For each constrained model, the prior distribution of the model parameters is formulated 
following the encompassing prior approach. Then, model selection is performed by using 
Bayes factors which are estimated by an importance sampling method. The approach 
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1 Introduction 



Though log-hnear models are frequently used for the analysis of contingency tables, they do 
not allow to express, and consequently test, several hypotheses that are usually of interest, 
mainly because lower order interactions do not refer to the marginal distributions to which 
they seem to refer. This motivated McCullagh & Nelder (1989, Section 6.5) to introduce a 
class of models in which the joint distribution of a set of categorical variables is parametrised 
through the highest log-linear interactions within each possible marginal distribution. Several 
other models have been proposed following the original idea of McCullagh & Nelder (see 
Glonek & McCullagh, 1995, Glonek, 1996, Colombi & Forcina, 2001, and Bergsma & Rudas, 
2002, and Bartolucci et al, 2007). 

In this paper, we deal with a flexible class of models in which: (i) the parameters of the 
saturated model are given by generalised logits, in the sense of Douglas et al. (1991), for 
each univariate marginal distribution, generalised log-odds ratios for each bivariate marginal 
distribution and similar interactions for each higher-order marginal distribution; (ii) any con- 
strained model may be formulated through linear equality and inequality constraints on such 
parameters. In this way we may express several hypotheses which are of special interest in 
presence of ordinal variables (see Bartolucci et al, 2001, and Colombi & Forcina, 2001), as 
for instance, that: (i) the marginal distribution of one variable is stochastically larger than 
that of another variable, provided that these have the same categories; (ii) a certain type of 
positive association between a pair of variables holds; (iii) the marginal distribution of one 
variable is stochastically increasing with respect to the level of an explanatory variable. 

For the above class of models we develop Bayesian inference and, in particular, a model 
selection strategy based on the Bayes factor (see Jeffreys, 1935, 1961, and Kass & Raftery, 
1995), which is defined as the ratio between the marginal likelihoods of two competing models. 
For the marginal models considered in this paper, the use of the Bayes factor allows us to easily 
compare two models parametrised through different types of logit, which would be otherwise 
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cumbersome using a likelihood ratio test. Moreover, since the Bayes factor is computed as 
the ratio between two marginal likelihoods, the presence of the nuisance parameters does not 
affect the inferential results, as it typically happens when using a likelihood ratio test (for a 
discussion on this point see Dardanoni & Forcina, 1998, and Bartolucci et al, 2001). On the 
other hand, the proposed approach requires the specification of a prior distribution on the 
parameters that is not required within the likelihood ratio approach. 

While the decision theoretic approach leads us to select the model with largest marginal 
likelihood, we can also use the Bayes factor as a measure of evidence. In order to assess this 
evidence we refer to the Jeffreys (1961) scale, which gives the following guideline: a log Bayes 
factor below 0.5 indicates poor evidence, between 0.5 and 1 substantial, between 1 and 2 strong 
and decisive evidence is provided by a Bayes factor larger than 2. 

Bayesian methods for the analysis of categorical data has been dealt with by several au- 
thors. For instance, Albert (1996, 1997) used the Bayes factor to test hypotheses such as in- 
dependence, quasi-independence, symmetry or constant association in two-way and three-way 
contingency tables. Dellaportas & Forster (1999) proposed a general framework for selecting 
a log-linear model through the Reversible Jump algorithm of Green (1995) under a multivari- 
ate Normal prior distribution on the parameters. In practice, both Albert (1996, 1997) and 
Dellaportas & Forster (1999) dealt with log-linear models obtained by imposing some linear 
equality constraints on the parameters of the saturated model, for example that a subset of 
the parameters is equal to zero. Klugkist and Hoijtink (2007), Hoijtink et al. (2008), Klugkist 
et al. (2005a, 2005b, 2010) and Wetzels et al. (2010) used, instead, the Bayes factor to com- 
pare competing models expressed through linear inequality and about equality constraints on 
the saturated model. Under their encompassing prior approach, the Bayes factor between a 
constrained model and the encompassing model reduces to the ratio of the probability that the 
constraints hold under the encompassing posterior distribution and the probability that they 
hold under the encompassing prior distribution. By encompassing model we mean a model, 
the parameter space of which includes that of every other model under consideration. There- 
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fore, once the prior distribution has been specified on the encompassing model parameters 
(encompassing prior), it is automatically specified for each submodel. 

The selection strategy we adopt for the class of models considered in this paper is related 
to the approach of Klugkist et al. (2010). We exploit their encompassing prior approach, 
which leads to a logically coherent assessment of prior and posterior model probabilities and 
parameters distributions, as well as an easy estimation of the Bayes factors. However, our 
work differs from that of Klugkist et al. (2010) mainly in three respects: (i) we consider 
a more general class of models for categorical data; (ii) we propose an importance sampling 
method to improve the efficiency of the Bayes factor estimates for models with very small prior 
and, possibly, posterior probabilities; (iii) we introduce an iterative algorithm to estimate the 
Bayes factor for models specified through about equality constraints, which does not require 
to sample from a constrained model parameter space. 

The paper is organised as follows. In Section 2 we describe the class of models of interest. 
Then in Section 3 we review the encompassing prior approach and we deal with Bayesian model 
selection. Finally, in Section 4 we illustrate the proposed approach through three applications 
involving some datasets of interest in the categorical data analysis literature. 

2 Marginal models for categorical variables 

In this section, we introduce the marginal models developed by McCullagh & Nelder (1989) 
and illustrate the parameterisation based on generalised logits and log-odd ratios. Then, we 
show how hypotheses of interest may be expressed through linear equality and inequality 
constraints imposed on the parameters of the saturated model. 

2.1 Parametrisation 

Let A = {Al, . . . , Ag) be a vector of q categorical variables and {1, . . . , rrii} be the support 
of Ai. Also let r = Hj'^j be the number of possible configurations of A and tt be the r- 
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dimensional column vector of the joint probabilities p{A = a) arranged in lexicographical 
order. In the following, we describe a saturated parameterisation of such a vector based on 
marginal logits, marginal log-odds ratios and similar higher-order interactions. 

Marginal logits may be of type local (/), global (g), continuation (c) or reverse continuation 
(r). For the i-th variable these are defined as follows for = 1, . . . , mj — 1: 

p{Ai = ai + 1) 



local: T]i{ai; I) = lo. 

global: r]i{ai; g) = log 

continuation: rji{ai]c) = log 

reverse continuation: t]i{ai]r) = log 



p{Ai = Qi) 

pjAj >ai + l)_ 

pjAj > + 1) _ 
p{Ai = ai) 

p{Ai = Qi + l) 



p{Ai < ai) 

Local logits are used when it is of interest to compare the marginal probability of each cate- 
gory with that of the previous category. Logits of type global and continuation are specially 
tailored to ordinal variables. In particular, logits of type global are more appropriate when 
the variable may be seen as a discretised version of an underlying continuum, whereas logits of 
type continuation are more appropriate when categories correspond to levels of achievement 
that may be entered only if the previous level has also been achieved, as in education. Finally, 
using logits of type reverse continuation is the same as arranging categories in reverse order 
and using logits of type continuation. 

Marginal log-odds ratios are defined as contrasts between conditional logits. For two 
variables, Ai and Aj, the most well-known log-odds ratios are shown in the following, where 
Oj = 1, . . . , mj — 1 and aj = 1, . . . , mj — 1: 

• Local: when logits of type / are used for both Ai and Aj 



r]ij{ai,aj;lj) = r]j{aj;l\Ai = ai + 1) - rij{aj;l\Ai = ai) 
= log 



p{Ai = Qi, Aj = aj)p{Ai = ai + 1, Aj = aj 



' p{Ai = tti + l, Aj = aj)p{Ai = ai, Aj = aj + 1) ' 



• Local- Global: when logits of type / are used for Ai and of type g for Aj 



r]ij{ai,aj-J,g) 



r]j{aj;g\Ai = + 1) - r]j{aj;g\Ai = ai) = 

pjAj = ttj, Aj < aj)p{Ai = + 1, Aj > aj + 1) _ 
°^ p{Ai = ai + 1, Aj < aj)p{Ai = ai, Aj > aj + 1) ' 



• Global: when logits of type g are used for both Ai and Aj 



r]ij{ai,aj]g,g) 



r]j{aj;g\Ai > + 1) - r]j{aj]g\Ai < ai) = 

p{Ai < ai, Aj < aj)p{Ai >ai + l,Aj> aj + 1) 
p{Ai > ai + 1, Aj < aj)p{Ai < ai, Aj > aj + 1) 



Similarly, three-way interactions are defined as contrasts between conditional log-odds ratios 
and so on for higher order interactions. 

Now let z be an r-dimensional vector of zeros and ones, let r]^ be a column vector containing 
all the marginal interactions between the variables Ai such that Zi = 1 and let r] be the vector 
obtained by stacking, in lexicographical order, the vectors rj^ one below the other for any 
2 7^ 0. Following Colombi & Forcina (2001), such a vector, which provides the saturated 
parametrisation of tt at issue, may be simply obtained as 



where C and M are appropriate matrices, whose construction is described in Appendix A. 
Note, however, that to invert equation (1), and so obtain tt in terms of rj, we must rely on a 
Newton-Raphson algorithm as the one described in Glonek & McCuUagh (1995) and Colombi 
& Forcina (2001). 

2.2 Constrained models 

A variety of constrained models may be formulated by posing linear equality and inequality 
constraints of the form 



rj = Clog(M7r), 




Et] = 0, C/r; > 0, 



(2) 



6 



on the saturated parameter vector. Here, and throughout the paper, equahty constraints are 
substituted by about equahty constraints of type \Er]\ < e, for a vector e > having suitably 
smah elements; this choice is motivated in Section 3.2.1. 

Consider first the case of only two variables, Ai and A2. The most interesting hypotheses 
are usually on the association between these variables. Let c = mi + m2 — 2 be the number 
of the marginal logits and d = (mi — l)(m2 — 1) be that of the log-odds ratios. By requiring 
that all the log-odds ratios are non-negative, namely by letting 

t/=(0,,e Id), (3) 

where Od,c is a matrix of d x c zeros and Id denotes an identity matrix of dimension d, we 
express the hypothesis of positive association between Ai and A2. Obviously, the type of 
association depends on the type of logit that is used for the two variables. For instance, with 
local logits for both variables we are formulating the hypothesis of Total Positivity of Order 2 
(TP2; see Karlin, 1968), whereas with global logits for both variables we are formulating the 
hypothesis of Positive Quadrant Dependence (PQD; see Lehmann, 1966). Note that there is 
a hierarchy among these notions of positive association in the sense that, for instance, TP2 
implies that all the continuation log-odds ratios are non-negative which, in turn, implies PQD 
(see Douglas et al. 1991, for details). Also note that, regardless of the type of log-odds ratio, 
independence between Ai and A2 may be expressed through the constraint that rather 
than U , is equal to the matrix in (3). A less stringent constraint than that of independence is 
the constraint of uniform association, namely that all the d log-odds ratios are equal to each 
other (Plackett, 1965). This type of constraint is formulated by letting 

^; = (o,_i,, Dd-i), 

where, in general, = {Oh^i Ih^i) — {Ih-i Oh~i), with 0^ = OhA, is a matrix that 
produces first differences. 

When Al and A2 have the same categories, the number of which is indicated by m, con- 
straints on the univariate marginal distributions may also be of interest. For instance, we may 
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formulate the constraint of marginal homogeneity by letting 

E = ( —Im~l Im.-l Om~l,d ) • 

When global logits are used for both variables, the constraint that A2 is stochastically greater 
than Ai may be imposed by letting U, rather than E, equal to the matrix above. When there 
are more than two variables, similar constraints may also involve interactions of order higher 
than two. These are typically of interest in the presence of longitudinal data. Some useful 
insights on how formulating a marginal model in these situations are provided by Glonek & 
McCullagh (1995, Section 5). 

The approach outlined above may be easily extended when dealing with one or more 
explanatory variables, which are collected in vector B. Every possible configuration of these 
variables, say b, defines a stratum conditionally on which we have a vector of joint probabilities 
of the response variables, 7r(b), and a vector of marginal parameters, ^/(b), defined as in (1). 
Obviously, in this setting we may impose the same constraints illustrated above within each 
stratum and also constraints involving the parameters of different strata. These constraints 
are still of the type Erj = 0, Urj > 0, but in this case by 77 we mean the vector obtained by 
stacking one below the other the vectors r]{b) for every b. For instance, when we only have 
two response variables, we may express the constraint of conditional independence between 
Ai and A2, given B, as 

E = I,^ { Od,c Id ) , 

where s is the number of strata, that is the number of different configurations of B, and (8) 
denotes the Kronecker product. We may also formulate the hypothesis that Ai and A2 have 
the same degree of association for each stratum, by letting 

E = Ds^ {Od,c Id), 

or that the explanatory variables do not affect the marginal distributions of Ai and A2, by 
letting 
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Finally, if we have only one explanatory variable, B, and this is ordinal, we can express the 
constraint that the marginal distributions of Ai and A2 increase with the level of B by letting 
U, rather than E, equal to the matrix above. 

3 Bayesian estimation and model selection 

In this section we show how to make inference on the models presented in the previous section. 
In particular. Section 3.1 is devoted to the issue of choosing appropriate priors for the class 
of models at hand, whereas Section 3.2 is focused on assessing the plausibility of the different 
models, given an observed contingency table. 

In the following, when the data are not stratified, we denote the frequency corresponding to 
the configuration a of such a table by ya and by y the vector with elements ya arranged as in 
TT. When the data are stratified according to one or more explanatory categorical variables, we 
have a vector of frequencies y{b) for every configuration of such variables and, consequently, 
y denotes the vector obtained by stacking one below the other the vectors y{b) for every b. 

3.1 Prior distributions 

In a Bayesian framework, it is natural to include equality and inequality constraints imposed 
on the model parameters as prior knowledge. Since all the constrained models presented in 
Section 2.2 are nested in an unconstrained or encompassing model, we use the concept of 
encompassing prior (Klugkist et ai, 2005a, 2005b; Klugkist & Hoijtink, 2007). Therefore, 
we specify the prior distribution only for an encompassing model and then we derive the 
prior distributions for the other models by restricting the parameter space according to the 
constraints of interest. This approach has the very nice interpretation that the resulting Bayes 
factor for model selection (see Section 3.2) coincides with the ratio between the proportions 
of the parameter space that are in agreement with the constrained model, under the posterior 
and the prior distributions of the encompassing model. This approach also has the advantage 
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that only one single prior distribution needs to be specified. Moreover, the method can be seen 
as a generalisation of the Savage-Dickey density ratio, which overcomes the Borel-Kolmogorov 
paradox (Dawid & Lauritzen, 2001). See Wetzels et al. (2010) for a detailed discussion on 
this point. 

In the present framework, it is natural to choose the saturated model based on the parame- 
ter vector TT as the encompassing model. Under this model, the parameter space is the simplex 
of dimension r and the frequency vector y has multinomial distribution with parameters n 
and TT. The choice of the tt parameterisation, rather than the parameterisation based on the 
vector of marginal parameters 77, is motivated by the fact that it also makes straightforward 
the comparison between different types of logit. 

Let Ml indicate the saturated (encompassing) model and let p{tz\Mi) denote the en- 
compassing prior distribution. The prior distribution of each constrained model M^, for 
k = 2, . . . , K, follows directly from this prior as 



where the integral is on the simplex of dimension r. Moreover, (5fe(7r) is the indicator function 
equal to 1 if tt is in accordance with the constraints defining model Mk and to otherwise 
and Cfc is the inverse of the proportion of the parameter space that, under the encompassing 
prior, is in agreement with these constraints. Obviously, the constrained prior in (4) is not 
defined for a model with equality constraints, but it is defined for a model with about equality 
constraints. 

Under the encompassing prior approach, also the posterior distribution of the parameters 
for each constrained model immediately follows from the posterior under the encompassing 
model. In particular, we have 



where, now, dk is the inverse of the proportion of the parameter space that, under the encom- 
passing posterior, is in agreement with the constraints of model M^. 







10 



Coming to the issue of choosing a distributional shape for the encompassing prior p(7r|Mi), 
the default prior for tt has been acknowledged to be the one in which tt has a uniform 
distribution on the simplex of dimension r or, equivalently, p{tt\Mi) ~ D{lr), where D{-) 
denotes the Dirichlet distribution and 1^ is a column vector of r ones. See, for instance, Tuyl 
et al. (2009) for a detailed discussion on this choice. 

The posterior for the saturated parameterisation tt, with the default prior choice, is readily 
derived and it is of type D{lr + y)- Therefore, samples can be drawn independently from the 
prior and posterior distributions for the saturated model and the corresponding normalising 
constants are available in closed form. 

3.2 Model Selection 

Let Ai = {Ml, . . . , Mk} denote the set of models of interest. As already noted, each of these 
models is defined by a certain type of logit for every response variable and by constraints of 
type (2) on the vector of marginal parameters, with the exception of model Mi which is the 
saturated model. Then, M2, . . . , Mk are all nested in Mi, but not necessarily nested in one 
another. 

For model selection, we make use of the Bayes factor, which is the ratio of the marginal 
likelihoods of two competing models. Thus, the Bayes factor for model Mk versus the encom- 
passing model is defined as: 

_ p{y\Mk) _ Jp{y\7z,Mk)p{7r\Mk)d'n: 
p{y\M,) Jp{y\7z,M^)pi7^\M,)dn' 

where p{y\TT, M^) and p{y\Mk) denote, respectively, the likelihood of the data and the marginal 
likelihood for model M^. The Bayes factor measures the evidence that the data provide for one 
model versus the other and correponds to the fold change from prior model odds to posterior 
model odds. In this paper we always use a 0-1 loss. Obviously, the larger is Bki-, the greater 
is the evidence provided by the data in favour of Mk with respect to Mi (see Kass & Raftery, 
1995). So, when Bki is larger than 1, or equivalently \og{Bki) > 0, model Mk has to be 
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preferred to model Mi. To compare more than two models, or equivalently to choose the best 
model in Ai when K > 2, a convenient possibility is to single out Mi as the reference model 
and then compute the Bayes factor between every other model and the unconstrained one, 
that is i?fci, for k = 2, . . . , K. The model to be preferred is that with the largest Bayes factor, 
provided that it is larger than 1; otherwise the best model is the saturated model. Obviously, 
the Bayes factor for comparing every pair of models M^ and M/, not necessarily nested, is 
straightforwardly computed as B^i = Bki/Bn. 

It is important to note that the Bayes factor, as model selection tool, combines goodness 
of fit with a correction for model complexity. 

3.2.1 Computational issues in estimating the Bayes factor 

Direct computation of the Bayes factor is almost always infeasible, and this also happens for 
the class of models dealt with here. Several methods have been proposed to estimate the 
Bayes factor numerically, but the estimation is generally cumbersome from the computational 
point of view. 

The encompassing prior approach renders a nice interpretation of the Bayes factor for 
a constrained model M^ with the encompassing model Mi, which virtually eliminates the 
computational complications inherent in Bayes factor estimation. In fact, as demonstrated 
in Klugkist et al. (2005a), the Bayes factor for a constrained versus the encompassing model 
reduces to the ratio of the proportions of the parameter space that are in agreement with the 
constrained model under the posterior distribution and prior distribution of the encompassing 
model. Thus, the Bayes factor for a constrained model M^ with respect to the encompassing 
model Ml is 

Bui = ^. (6) 

In the light of (6), estimating the Bayes factor is particularly simple. The encompassing 
prior is sampled and Ck is estimated by Ck, which is the inverse of the proportion of the sample 
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that is in agreement with the constraints defining model M^. Similarly, sampling from the 
encompassing posterior allows us to estimate as d^, which is the inverse of the proportion 
of the sample that is in agreement with the constraints of model M^. In this way, using just 
one sample from the encompassing prior and another from the encompassing posterior, the 
estimate 

A _ Cfc 
-Dfcl — -J- 

dk 

can be computed for each constrained model M^, k = 2, . . . , K. 

Notice that, in our setting, the choice of the Dirichlet default prior for tt allows to sam- 
ple independently under both the encompassing prior and posterior distributions, leading to 
further simplifications in estimating the Bayes factor. Moreover, in some cases, can be 
computed exactly, without the need of sampling from the encompassing prior. However, there 
are two issues that we must deal with when estimating the Bayes factor. 

First of all, a rare event problem can arise. Consider for instance our example in Section 
4.1: we have a six by six contingency table with 35 free parameters under the unconstrained 
model. The hypothesis of positive association is formulated by requiring the positivity of the 
25 log-odds ratios. When using logits of type / for both variables, the constant c^^ for a positive 
association model can be calculated exactly as 0.5^5 = 2.9802 x 10"^ In this case, sampling 
from the encompassing prior is not required, but such a small values of c^^ can be common 
to other models. For these models, even if we drew millions of values from the encompassing 
prior, we would expect to see no values satisfying the constraint. This would lead an estimate 
of the Bayes factor equal to oo or to oo/oo, in case the same problem also arises when sampling 
from the encompassing posterior. In general, even if a finite estimate of the Bayes factor can be 
achieved, its variance would be huge for those constrained models characterised by a very small 
proportion of the parameter space in agreement with the constraints under the encompassing 
prior and, possibly, posterior distribution. 

The problem described above is that of rare event simulation (e.g., Bucklew, 2004), which 



13 



is often overcome through importance sampling. Suppose we want to estimate 1/cfc. From 
(4) it immediately follows that l/c^ = Ep{6k{T7)), where the expected value is calculated with 
respect to the encompassing prior p(7r|Mi). Now, letting (7(77) be any other density such that 
p{7r\Mi) = whenever (7(77) = 0, we can re-write 

Ep{6k{7T)) = J p{n\A'h)6k{7T)d7T = J 

where the last expected value is now calculated under the importance density 5'(7r). Then, 
an importance sampling estimate of l/c^ can be obtained by sampling tt from an appropriate 
importance density gi^T^) and estimating the last expected value in (7) through the sample 
mean. If required, an estimate of 1/dk can be obtained in a similar way, after choosing an 
appropriate sampling density. 

In this paper, we propose an automatic way to obtain an adequate importance sampling 
density. Suppose we want to estimate l/d^ for a certain model M^; then the proposed method 
is based on the following steps: 

(i) compute the maximum likelihood estimate of the vector r] under the constraints imposed 
by model (see Colombi & Forcina, 2001) and indicate this estimate by •fj; 

(ii) convert fj into tt using the Newton-Rapson algorithm described in Glonek & McCuUagh 
(1995) and Colombi & Forcina (2001); 

(iii) choose as importance density a Dirichlet distribution with mean vector equal to tt, that 
is (7(77) ~ D{aTt), where a is a tuning parameter that can be appropriately chosen 
so that enough draws from the importance density satisfy the constraints imposed by 
model Mfc. The optimal tuning parameter could be chosen by minimizing the variance 
of the approximation, but this expression depends itself on the target quantity. A simple 
approach, which we use in this paper, is to try different values on a suitable grid (say 
from 0.02 to 50). 



p(7r|M: 



^4(7r) 



5((7r)d7r = Eg 



4(77 



^p(7r|Mi 



, (7) 
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The same strategy can be adopted for choosing an appropriate importance density to estimate 
1/cfc. In this case, the maximum hkehhood estimate i) in (i) will be that corresponding to 
a hypothetical contingency table having a vector of frequencies y with all elements equal to 
zero. 

To give an idea of the precision of the algorithm, we consider again the above mentioned 
example in Section 4.1. For those data, we compare the true value of l/c^, exactly computable 
for the TP2 model, with its estimates obtained in three separate runs of the algorithm. The 
results, which are given in Table 1, show that the approximation is rather satisfactory in all 
cases. 

True value Estimate #1 Estimate 7^2 Estimate #3 
2.9802 X IQ-^ 2.2315 x IQ-^ 2.8510 x IQ-^ 3.5776 x 10"^ 

Table 1: True and estimated 1/ck for model TP2 on data in Section 4-i- 

The second issue in estimating the Bayes factor arises in the presence of about equality 
constraints. As already noted, for models formulated accordingly to strict equality constraints, 
the Bayes factor cannot be interpreted as the ratio between the proportions of encompassing 
posterior and prior in agreement with the constraints, since these proportions would be exactly 
zero. However, it has been recently shown (Wetzels et ai, 2010) that the encompassing 
approach naturally extends to exact equality constraints by considering the ratio of the heights 
for the encompassing posterior and prior distributions evaluated under the constraint (i.e., 
the Savage-Dickey density ratio). However, this approach to handle hypotheses specified 
through exact equality constraints complicates the computation of the Bayes factor for models 
containing both equality and inequality constraints. For this reason, we rather preferred to 
follow the idea of Berger and Sellke (1987) and Klugkist et al. (2010) of substituting exact 
equality with about equality constraints. In this way, the interpretation of the Bayes factor 
provided in (6) is preserved and models containing inequality or about equality constraints, as 
well as a mix of both constraints, can be handled in a unified manner. Moreover, Berger and 
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Delampady (1987) noted that a Bayes factor based on equality constraints is indistinguishable 
from a Bayes factor based on about equality constraints, provided that the interval around 
the exact equality constraint is small enough. However, if this interval is too small, we incur 
again in the rare event problem illustrated above, when trying to estimate c^^ and c?^^. 

To solve the above problem, Klugkist et al. (2010) proposed a stepwise procedure which 
guarantees that a small enough interval is used and does not actually need to pre-specify 
the size of this interval. In principle, we could use the method of Klugkist et al. (2010) to 
estimate the required constants Cf and dt. This method is based on drawing random numbers 
from suitably truncated Gamma distributions, which are then normalised to obtain the vector 
77. The way in which the support of these distributions is chosen depends on the adopted 
constraints. In our case, however, the complexity of the constraints implies that it is difficult 
to define how the support of each of these variables must be constrained; on the other hand, 
a rejection sampling procedure to draw random values from the truncated normal would be 
rather slow. For these reasons, we prefer to adapt the iterative procedure in Klugkist et al. 
(2010) exploiting, again, the importance sampling method. According to our procedure, only 
two different samples, one drawn from the importance density for the prior and the other 
one from the importance density for the posterior, are required to estimate the Bayes factor, 
thus overcoming the problem of sampling from constrained distributions, which affects the 
procedure in Klugkist et al. (2010). The details of the corresponding algorithm are given in 
Appendix B. 

Coming to the issue of parameter estimation, we need to acknowledge that, for the class 
of models considered here, obtaining point or interval estimates of the parameters is not 
in general of great interest. The main interest rather lies in model selection as a tool for 
evaluating which hypothesis is mostly supported by the data. Nevertheless, once a particular 
model Mk has been selected for the data at hand, Bayesian parameter estimation is based 
on the posterior distribution of the model parameters and, in our setting, a sample from this 
posterior is already available after model choice as a byproduct of the procedure to estimate 
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dk- In particular, we take the set of all the draws from the parameter posterior distribution, 
D{lr + y), of the saturated model that are in agreement with the constraints imposed by 
model Mfc. This set should contain enough draws to be also used for parameter estimation 
purposes. 

Obviously, parameter estimates can be obtained in the way described above for models de- 
fined by about equality constraints but not for models defined by exact equality constraints. 
As an alternative, if estimates under exact equality constraints are required, the parameter- 
isation in 77 can be used, after choosing an appropriate prior distribution on this parameter 
vector, for example a Gaussian distribution. However, as already noticed, such a parameteri- 
sation would complicate model selection in the presence of models expressed through different 
types of logit. 

4 Applications 

In the following, we illustrate the proposed approach through three applications involving 
some interesting datasets which also include explanatory variables. In the first application, 
illustrated in Section 4.1, we also propose an analysis of sensitivity with respect to the prior 
specification. 

4.1 Classification of men by social class and social class of their 
fathers 

We first consider a dataset (see Table 2) referred to a sample of British males cross-classified 
according to their occupational status {A2) and that of their father (Ai). 
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Table 2: Father (Ai) and son (A2) occupational status for a sample of 3,488 British males. 

The data have been already analysed by other authors, such as Goodman (1991) and Dar- 
danoni & Forcina (1998). In particular, Dardanoni & Forcina (1998), following a likelihood 
ratio approach, concluded that the data conform to some forms of positive association. How- 
ever, due to presence of nuisance parameters, given by marginal column probabilities, they 
did not reach a definitive conclusion about TP2. 

For these data we first compared the saturated model (Mi) with the independence model 
(M2), the saturated model incorporating PQD (M3) and that incorporating TP2 (M4). For 
each of the latter three models we estimated the Bayes factor with respect to the saturated 
model, taken as reference model, through the algorithm described in Section 3.2. We obtained 
the following results: 

iog(4i) iog(4i) iog(4i) 

-34.88 4.32 5.12 
In order to give to the reader an idea of the computational details, we point out that to 
compute B21 (see Section 3.2.1) we used a = 20 for the prior and a = 1 for the posterior, where 
as importance density we used a Dirichlet with parameters corresponding to the independence 
model itself. Since we have about equality constraints, we have used the algorithm in Appendix 
B starting from e = 0.1, with tuning parameter b = 0.5. The algorithm has stopped after 
two iterations, hence with e = 0.025. The approximation has been replicated B = 100 times 
and we found it fairly stable (we obtained a standard deviation of the 100 replicates smaller 
than 2). We report the average estimate, which can be seen as a single estimate obtained 



18 



from a concatenated sample. The other two Bayes factors do not involve about equality, 
but only inequality constraints. For the case of PQD (i.e., B^i) we did not need importance 
sampling because sampling directly from the prior and posterior gives a large number of 
samples satisfying the constraints. 

The hypothesis of independence must be definitely rejected, whereas that of positive asso- 
ciation may be accepted. In particular, the model incorporating TP2, formulated by requiring 
that all the local log-odds ratios are non negative, has to be preferred to that incorporating 
PQD, which is formulated through global log-odds ratios. This means that the data con- 
form to the strongest notion of positive association among those considered by Douglas et al. 
(1991). Hence, we can state that sons coming from a better family have a higher chance of 
success also conditional on remaining within any given subset of neighbouring classes. On 
the other hand, the hypothesis of uniform association has to be rejected since, comparing the 
model incorporating this constraint in addition to TP2 (M5) with model M4, we obtained 
log(S54) = -28.01. 

In order to perform some sensitivity analysis, we also calculated the Bayes factors in the 
table above under other three different Dirichlet prior parameters, obtaining the following 
results: 

log(4i) log(4i) log(4i) 

D(0.51^) -34.49 4.26 5.04 
D{21r) -34.22 4.36 5.26 
D{51r) -32.18 4.39 6.04 

It can be seen that there is only a slight sensitivity to prior assumptions for these data. 

By varying the prior we do not reach different conclusions with respect to model choice. We 

obtained similar results, not reported here, for the other Bayes factors computed in this section 

and in the next two sections. 

Moving back to the data, we also considered some constraints on the marginal distributions 

of the response variables. In particular, we considered model Mg, formulated by incorporating 

in M4 the constraint that the marginal distributions of Ai and A2 are equal, and model 



19 



My, by incorporating in M4 the constraint that every local logit of A2 is greater than the 
corresponding local logit of Ai, this in turn implies that the marginal distribution of A2 is 
stochastically greater than that of Ai. The Bayes factors of these two models with respect to 
M4 are: 

log(44) l0g(^74) 

-0.78 2.22 

Model Mf seems to be supported by the data. This means that we can observe not only pure 
mobility, that is positive association between family's origin and the son's status, but also 
structural mobility, which instead refers to how far apart the two marginal distributions are 
and is essentially related to socioeconomic growth. 

4.2 Classification of elderly people by Alzheimer's disease and cog- 
nitive impairment 

The second dataset we analysed (see Table 3) is referred to a sample of elderly people cross- 
classified by Alzheimer's disease (Ai) and cognitive impairment (^2); stratified by age (B); 
the data are taken from Agresti (1990, p. 298). The levels of Ai are: (IV) highly probable; 
(III) probable; (II) possible; (I) unaffected; the levels of A2 are: (V) severe; (IV) moderate; 
(III) mild; (II) borderline; (I) unaffected. 
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Table 3: Alzheimer's disease (Ai) and cognitive impairment (A2) for a sample of 513 elderly 
people, stratified by age (B: less than 15, more than 15). 

As the categories of both response variables are in reverse order, we based our analysis 
on reverse continuation logits. In this setting, we compared the saturated model (Mi) with 
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the model of conditional independence (M2) and the saturated model incorporating positive 
association in every stratum (M3). The estimated Bayes factors are: 

l0g(^2l) l0g(4l) 

-6.31 4.76 

The hypothesis of conditional independence is not supported by the data, whereas that of 
positive association in each stratum is strongly supported. This means that, also conditionally 
on the age, worst diagnoses of Alzheimer's disease are associated with most severe cognitive 
impairment for both. 

Then, we tried to test further hypotheses on the association between the two response 
variables. In particular, we considered the following constraints: (i) the level of the association 
is the same in each stratum; (ii) the association is stronger in the first stratum; (iii) the 
association is stronger in the second stratum. The models obtained by incorporating these 
hypotheses in M3 are denoted, respectively, by M4, M5 and Mg. The estimated Bayes factors 
for these models with respect to M3 are: 

log(^43) log(43) log(43) 

-4.26 -22.40 8.12 
A certain amount of evidence in favour of model Mg is noted. Using this as reference model, 

further constraints on the marginal distributions can be added, such as: the marginal distri- 
bution of Ai increases, namely the reverse continuation logits decreases, with age (M7); the 
marginal distribution of A2 increases with age (Mg); both marginal distributions increase with 
age (Mg). We have the following results: 

log(^76) log(^86) lQg(^96) 

6.37 17.17 22.14 
Models M7, Mg and Mg seem to be all compatible with the data, therefore we chose the one 

with the highest Bayes factor as the most plausible one, that is model Mg. This implies that, 

as age increases, individuals are more likely to have a serious level of cognitive impairment 

and to be diagnosed the Alzheimer's disease with a higher degree of confidence. Therefore, 

age does not only affect the association between the two response variables, that is stronger 

for elder people, but also shows a direct effect on their marginal distributions. 
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4.3 Clinical trial for skin disorder 

The dataset in Table 4, already analyzed by Glonek & McCuUagh (1995) and Koch et al. 
(1991), refers to a clinical trial which, for confidentiality, was fictitiously described as pertaining 
to the treatment of a skin disorder. The 72 subjects in the study are divided into two groups, 
the first one receiving the treatment and the second one receiving the placebo. An ordinal 
response variable, with levels poor/fair (I), good (II) and excellent (III), was recorded for each 
subject on four different occasions: 3 days (^i), 7 days (^42), 10 days (^3) and 14 days (^2) 
after treatment. Given the nature of the response variables, it is natural to use global logits. 
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Table 4: Response to treatment over time (A\, A2, A3, A4) for a sample of 72 subjects, 



stratified by type of treatment (B). 

The data are very sparse, as 128 of the 162 cells are empty. Therefore, following Glonek 
& McCullagh (1995), the largest model we considered is a reduced model in which all the 
interactions of order higher than two are set equal to zero. Such a model, that we indicate by 
M2, is less restrictive than the largest model considered by Glonek & McCullagh (1995) which 
assumes that the association between every pair of response variables is the same in the two 
strata. Note that Mi, the saturated model, is still used as a reference model to calculate the 
Bayes factors but is not included in the set of models under choice. 

We first compared model M2 with the model that Glonek & McCullagh (1995) chose as final 
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model (M3). The latter is based on the following constraints: (i) there is uniform association 
within any stratum and between the strata; (ii) there is a constant shift between marginal 
logits over time and between strata. Comparing this model, with model M2, we obtained 
log(i?32) = 0.19. There is a very mild evidence in favor of M3. Finally, using M3 as reference 
model, we considered model M4 obtained from M3 by incorporating PQD and the constraint 
that the marginal distribution of each response variable is stochastically greater for the second 
stratum than for the first one. These hypotheses seem to be supported by the data, as we 
have log(E43) = 2.38. 
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Appendix 

A: Transformation from tt to 77 

The matrices C and M in (1) may be obtained as follows. C is a block diagonal matrix with 
blocks C zi ordered as 77^ in 77, given by 

i=l 

where Ci = 1 ii Zi = 1 and Ci = {Irm^i —Im,-i) otherwise. Similarly, M has blocks of 
columns given by 

Q 

= (g) M„ 

1=1 
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where Mi = 1'^. ii Zi = 0; otherwise, we have 

Inii—l Orrti — 1 
, Om^ — 1 I rrii — l 



O-mi-l 

Trai ~ 1 Orrti — 1 
Omi — 1 Inii — l 

where T/j is a h x h lower triangular matrix of ones 



if logits of type / are used for the z-th variable, 
if logits of type g are used for the z-th variable, 
if logits of type c are used for the i-th variable, 
if logits of type r are used for the z-th variable. 



B: Computing Bayes Factors with about equality constraints 

First of all, we recall that about equality constraints are specified as |£^r7| < e, for a small 
e > 0. If e is too large, the corresponding Bayes factor is far from the Bayes factor which would 
be obtained with precise equality constraints. If e is too small, estimates of the proportion of 
the encompassing prior and encompassing posterior in agreement with the constraints may be 
inefficient. 

In order to fix a suitable value for e, we adapt the iterative procedure of Klugkist et 
al. (2010). Suppose we want to estimate B^i for the constrained model versus the 
encompassing model, where the constrained model is subject to \Eri\ < e and, possibly, 
Ur} > 0. Our procedure comprises the following steps: 

1. choose a small value ei and define M^ i as the model in which e is put equal to ei; 

2. estimate B(^k.i)i = Ck.i/dk.i, where c^^ and dl^^ are, respectively, the proportions of the 
sample from the encompassing prior and posterior distributions in agreement with the 
constraints imposed by i; 

3. define €2 = bsi, with < 6 < 1, and Mk.2 as the model Mk in which e is put equal to £2; 
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4. estimate i?(fc.2)(fc.i) 



{ck.2/ dk.2) / {ck.i/ dk.i), where Cf^\ and dj^\ are, respectively, the 



proportions of the samples from the encompassing prior and posterior in agreement with 
the constraints imposed by Mk.2- 

Repeat steps 3 and 4, with each e„+i = 6e,„, until the condition -B{fc.n+i){fc.n) ~ 1 is not satisfied. 
Then the required Bayes factor estimate Bki can be calculated by multiplication: 



In the limit (i.e., when e„ — )• 0), this method yields the estimate of the Bayes factor for model 
Mfc with exact equality constraints versus the encompassing model. 

Notice that, in the procedure above, the problem of getting inefficient estimates for the 
proportion of encompassing prior and posterior in agreement with the constraints is solved by 
using the importance sampling approach described in Section 3.2.1. Thus, only two different 
samples, one drawn from the importance density for the prior and the other one from the 
importance density for the posterior, are required to compute all the Bayes factor estimates 



Bkl = X i?(fc.2)(fc.l) X ■ ■ ■ X B{^k.n)(k.n~l)- 



(8) 
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